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Abstract 

_ . In this paper we develop a theory of matrix completion for the extreme case of noisy 1-bit observa- 

tions. Instead of observing a subset of the real-valued entries of a matrix M, we obtain a small number 
jjy , of binary (1-bit) measurements generated according to a probability distribution determined by the real- 

valued entries of M. The central question we ask is whether or not it is possible to obtain an accurate 
estimate of M from this data. In general this would seem impossible, but we show that the maximum 
likelihood estimate under a suitable constraint returns an accurate estimate of M when || A^Hoo < a and 
rank(Af ) < r. If the log-likelihood is a concave function (e.g., the logistic or probit observation models), 
then we can obtain this maximum likelihood estimate by optimizing a convex program. In addition, 
we also show that if instead of recovering M we simply wish to obtain an estimate of the distribution 
generating the 1-bit measurements, then we can eliminate the requirement that H-M^joo < a. For both 
cases, we provide lower bounds showing that these estimates are near-optimal. 

1 Introduction 

The problem of recovering a matrix from an incomplete sampling of its entries — also known as matrix 
completion — arises in a wide variety of practical situations. In many of these settings, however, the ob- 
servations are not only incomplete, but also highly quantized, often even to a single bit. In this paper we 
consider a statistical model for such data where instead of observing a real-valued entry as in the original 
matrix completion problem, we are now only able to see a positive or negative rating. This binary output 
is generated according to a probability distribution which is parameterized by the corresponding entry of 
the unknown matrix M. The central question we ask in this paper is: "Given observations of this form, 
can we recover the underlying matrix?" 

We will see that 0(rd) measurements are sufficient to accurately recover a d x d, rank-r matrix from 
such data. Before describing this result and others in more detail, we provide a brief review of the matrix 
completion problem and the closely related problem of 1-bit compressed sensing. 



1.1 Matrix completion 

Matrix completion arises in a wide variety of practical contexts, including collaborative filtering [17], system 
identification [32], sensor localization [3, 44, 45], rank aggregation [16], and many more. While many of 
these applications have a relatively long history, recent advances in the closely related field of compressed 
sensing [14, 7, 13] have enabled a burst of progress in the last few years, and we now have a strong base 
of theoretical results concerning matrix completion [19, 10, 11, 24, 25, 36, 29, 9, 41, 42, 27, 15, 26, 28]. 
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A typical result from this literature is that a generic d x d matrix of rank r can be exactly recovered 
from 0(r d poly log (d)) randomly chosen entries. Similar results can be established in the case of noisy 
observations and approximately low-rank matrices [25, 36, 29, 9, 42, 27, 15, 26, 28]. 

Although these results are quite impressive, there is an important gap between the statement of the 
problem as considered in the matrix completion literature and many of the most common applications 
discussed therein. As an example, consider collaborative filtering and the now-famous "Netflix problem." 
In this setting, we assume that there is some unknown matrix whose entries each represent a rating for a 
particular user on a particular movie. Since any user will rate only a small subset of possible movies, we are 
only able to observe a small fraction of the total entries in the matrix, and our goal is to infer the unseen 
ratings from the observed ones. If the rating matrix is low-rank, then this would seem to be the exact 
problem studied in the matrix completion literature. However, there is a subtle difference: the theory 
developed in this literature generally assumes that observations consist of (possibly noisy) continuous- 
valued entries of the matrix, whereas in the Netflix problem the observations are "quantized" to the set of 
integers between 1 and 5. If we believe that it is possible for a user's true rating for a particular movie to 
be, for example, 4.5, then we must account for the impact of this "quantization noise" on our recovery. Of 
course, one could potentially treat quantization simply as a form of bounded noise, but this is somewhat 
unsatisfying because the ratings aren't just quantized — there are also hard limits placed on the minimum 
and maximum allowable ratings. (Why should we suppose that a movie given a rating of 5 could not 
have a true underlying rating of 6 or 7 or 10?) The inadequacy of standard matrix completion techniques 
in dealing with this effect is particularly pronounced when we consider recommender systems where each 
rating consists of a single bit representing a positive or negative rating (consider for example rating music 
on Pandora, the relevance of advertisements on Hulu, or posts on Reddit or MathOverflow) . In such a 
case, the assumptions made in the existing theory of matrix completion do not apply, standard algorithms 
are ill-posed, and a new theory is required. 

1.2 1-Bit compressed sensing and sparse logistic regression 

As noted above, matrix completion is closely related to the field of compressed sensing, where a theory to 
deal with single-bit quantization has recently been developed [5, 22, 37, 38, 21, 30]. In compressed sensing, 
one can recover an s-sparse vector in M. d from 0(s log(d/s)) random linear measurements — several different 
random measurement structures are compatible with this theory. In 1-bit compressed sensing, only the 
signs of these measurements are observed, but an s-sparse signal can still be approximately recovered from 
the same number of measurements [22, 37, 38, 1]. However, the only measurement ensembles which are 
currently known to give such guarantees are Gaussian or sub-Gaussian [1], and thus of a quite different 
flavor than the kinds of samples obtained in the matrix completion setting. A similar theory is available 
for the closely related problem of sparse binomial regression, which considers more classical statistical 
models [2, 6, 38, 23, 33, 35, 40, 47] and allows non-Gaussian measurements. Our aim here is to develop 
results for matrix completion of the same flavor as 1-bit compressed sensing and sparse logistic regression. 

1.3 Challenges 

In this paper, we extend the theory of matrix completion to the case of 1-bit observations. We consider a 
general observation model but focus mainly on two particular possibilities: the models of logistic and probit 
regression. We discuss these models in greater detail in Section 2.1, but first we note that several new 
challenges arise when trying to leverage results in 1-bit compressed sensing and sparse logistic regression to 
develop a theory for 1-bit matrix completion. First, matrix completion is in some sense a more challenging 
problem than compressed sensing. Specifically, some additional difficulty arises because the set of low-rank 
matrices is "coherent" with single entry measurements (see [19]). In particular, the sampling operator 
does not act as a near-isometry on all matrices of interest, and thus the natural analogue to the restricted 
isometry property from compressed sensing cannot hold in general — there will always be certain low-rank 
matrices that we cannot hope to recover without essentially sampling every entry of the matrix. For 
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example, consider a matrix that consists of a single nonzero entry (which we might never observe). The 
typical way to deal with this possibility is to consider a reduced set of low-rank matrices by placing 
restrictions on the entry-wise maximum of the matrix or its singular vectors — informally, we require that 
the matrix is not too "spiky" . 

We introduce an entirely new dimension of ill-posedness by restricting ourselves to 1-bit observations. 
To illustrate this, we describe one version of 1-bit matrix completion in more detail (the general problem 
definition is given in Section 2.1 below). Consider a d x d matrix M with rank r. Suppose we observe a 
subset of entries of a matrix Y. The entries of Y depend on M in the following way: 



where Z is a matrix containing noise. This latent variable model is the direct analogue to the usual 1- 
bit compressed sensing observation model. In this setting, we view the matrix M as more than just a 
parameter of the distribution of Y; M represents the real underlying quantity of interest that we would 
like to estimate. Unfortunately, in what would seem to be the most benign setting — when is the set of 
all entries, Z = 0, and M has rank 1 and a bounded entry-wise maximum — the problem of recovering 
M is ill-posed. To see this, let M = uv* for any vectors u,v £ R , and for simplicity assume that there 
are no zero entries in u or d. Now let u and v be any vectors with the same sign pattern as u and v 
respectively. It is apparent that either M or M = uv* will yield the same observations Y, and thus M 
and M are indistinguishable. Note that while it is obvious that this 1-bit measurement process will destroy 
any information we have regarding the scaling of M, this ill-posedness remains even if we knew something 
about the scaling a priori (such as the Frobenius norm of M). For any given set of observations, there will 
always be radically different possible matrices that are all consistent with observed measurements. 

After considering this example, the problem might seem hopeless. However, an interesting surprise is 
that when we add noise to the problem (that is, when Z ^ is an appropriate stochastic matrix) the 
picture completely changes — this noise has a "dithering" effect and the problem becomes well-posed. In 
fact, we will show that in this setting we can sometimes recover M to the same degree of accuracy that 
is possible when given access to completely unquantized measurements! In particular, under appropriate 
conditions, 0(rd) measurements are sufficient to accurately recover M. 

1.4 Applications 

The problem of 1-bit matrix completion arises in nearly every application that has been proposed for 
"unquantized" matrix completion. To name a few: 

• Recommender systems: As mentioned above, collaborative filtering systems often involve dis- 
cretized recommendations [17]. In many cases, each observation will consist simply of a "thumbs up" 
or "thumbs down" thus delivering only 1 bit of information (consider for example rating music on 
Pandora, the relevance of advertisements on Hulu, or posts on Reddit or MathOverflow) . Such cases 
are a natural application for 1-bit matrix completion. 

• Analysis of survey data: Another potential application for matrix completion is to analyze incom- 
plete survey data. Such data is almost always heavily quantized since people are generally not able 
to distinguish between more than 7 ± 2 categories [34] . 1-bit matrix completion provides a method 
for analyzing incomplete (or potentially even complete) survey designs containing simple yes/no or 
agree/disagree questions. 

• Distance matrix recovery and multidimensional scaling: Yet another common motivation 
for matrix completion is to localize nodes in a sensor network from the observation of just a few 
inter-node distances [3, 44, 45]. This is essentially a special case of multidimensional scaling (MDS) 
from incomplete data [4]. In general, work in the area assumes real- valued measurements. However, 
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in the sensor network example (as well as many other MDS scenarios), the measurements may be 
very coarse and might only indicate whether the nodes are within or outside of some communication 
range. While there is some existing work on MDS using binary data [18] and MDS using incomplete 
observations with other kinds of non-metric data [46], 1-bit matrix completion promises to provide a 
principled and unifying approach to such problems. 

• Quantum state tomography: Low-rank matrix recovery from incomplete observations also has 
applications to quantum state tomography [20]. In this scenario, mixed quantum states are rep- 
resented as Hermitian matrices with nuclear norm equal to 1. When the state is nearly pure, the 
matrix can be well approximated by a low-rank matrix and, in particular, fits the model given in 
Section 2.2 up to a rescaling. Furthermore, Pauli-operator-based measurements give probabilistic 
binary outputs. However, these are based on the inner products with the Pauli- matrices, and thus 
of a slightly different flavor than the measurements considered in this paper. Nevertheless, while we 
do not address this scenario directly, our theory of 1-bit matrix completion could easily be adapted 
to quantum state tomography. 

1.5 Notation 

We now provide a brief summary of some of the key notation used in this paper. We use [d] to denote the 
set of integers {1, . . . , d}. We use capital boldface to denote a matrix (e.g., M) and standard text to denote 
its entries (e.g., My). Similarly, we let denote the matrix of all-zeros and 1 the matrix of all-ones. We 

let \\M\\ denote the operator norm of M, \\M\\ F = y^Si j denote the Frobenius norm of M, H-M"^ 

denote the nuclear or Schatten-1 norm of M (the sum of the singular values), and 1 1 iW 1 1 ^ = maxjj |Mj ,-| 
denote the entry-wise infinity-norm of M. We will use the Hellinger distance, which, for two scalars 
p, q G [0, 1] , is given by 

d 2 H (p, q) := (VP ~ VQ) 2 + (n/W " V^Q) 2 - 
This gives a standard notion of distance between two binary probability distributions. We also allow the 
Hellinger distance to act on matrices via the average Hellinger distance over their entries: for matrices 
P,Q G [0,l] dlXd2 , we define 

Finally, for an event £ ,1m is the indicator function for that event, i.e., Irgi is 1 if £ occurs and otherwise. 

1.6 Organization of the paper 

We proceed in Section 2 by describing the 1-bit matrix completion problem in greater detail. In Section 3 we 
state our main results. Specifically, we propose a pair of convex programs for the 1-bit matrix completion 
problem and establish upper bounds on the accuracy with which these can recover the matrix M and 
the distribution of the observations Y. We also establish lower bounds, showing that our upper bounds 
are nearly optimal. The proofs of these results are given in Section 4. Section 5 concludes with a brief 
discussion. 



2 The 1-bit matrix completion problem 

2.1 Observation model 

We now introduce the more general observation model that we study in this paper. Given a matrix 
M G M dlXd-2 , a subset of indices C [d\] x [d 2 ], and a differentiable function / : R — > [0, 1], we observe 

Y = , , ~ with probability /(My), for ( . j} & (2) 



with probability 1 — f(M. t 
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We will leave / general for now and discuss a few common choices just below. As has been important in 
previous work on matrix completion, we assume that is chosen at random with E |f2| = m. Specifically, 
we assume that follows a binomial model in which each entry (i, j) £ [d\] x [cfo] is included in Q with 
probability ^^jj, independently. 

Before discussing some particular choices for /, we first note that while the observation model described 
in (2) may appear on the surface to be somewhat different from the setup in (1), they are actually equivalent 
if / behaves likes a cumulative distribution function. Specifically, for the model in (1), if Z has i.i.d. entries, 
then by setting f(x) := P(Zi i i > —x), the model in (1) reduces to that in (2). Similarly, for any choice of 
f(x) in (2), if we define Z as having i.i.d. entries drawn from a distribution whose cumulative distribution 
function is given by Fz(x) = P [z < x) = 1 — /(— x), then (2) reduces to (1). Of course, in any given 
situation one of these observation models may seem more or less natural than the other — for example, (1) 
may seem more appropriate when M is viewed as a latent variable which we might be interested in 
estimating, while (2) may make more sense when M is viewed as just a parameter of a distribution. 
Ultimately, however, the two models are equivalent. 

We now consider two natural choices for / (or equivalently, for Z): 

Example 1 (Logistic regression/Logistic noise). The logistic regression model, which is common in statis- 
tics, is captured by (2) with fix) = and by (1) with Zij i.i.d. according to the standard logistic 
distribution. 

Example 2 (Probit regression/Gaussian noise). The probit regression model is captured by (2) by setting 
f(x) = 1 — $(— x/a) = &(x/a) where <1> is the cumulative distribution function of a standard Gaussian 
and by (1) with Z{j i.i.d. according to a mean-zero Gaussian distribution with variance a 2 . 

2.2 Approximately low-rank matrices 

The majority of the literature on matrix completion assumes that the first r singular values of M are 
nonzero and the remainder are exactly zero. However, in many applications the singular values instead 
exhibit only a gradual decay towards zero. Thus, in this paper we allow a relaxation of the assumption 
that M has rank exactly r. Instead, we assume that ||Af ^ < a^/rdidq,, where a is a parameter left to be 
determined, but which will often be of constant order. In other words, the singular values of M belong to 
a scaled t\ ball. In compressed sensing, belonging to an £p ball with p G (0, 1] is a common relaxation of 
exact sparsity; in matrix completion, the nuclear-norm ball (or Schatten-1 ball) plays an analogous role. 

The particular choice of scaling, a^rdid,2, arises from the following considerations. Suppose that each 
entry of M is bounded in magnitude by a and that rank(iW) < r. Then 

||M||„ <y/r\\M\\ F < VreW2 H-M"^ < a^rd x d 2 . 

Thus, the assumption that ||A<f || # < a^/rdid^ is a relaxation of the conditions that rank(iW) < r and 
|| iVT 1 1 < a. The condition that < a essentially means that the probability of seeing a +1 or 

— 1 does not depend on the dimension. It is also a way of enforcing that M should not be too "spiky"; 
as discussed above this is an important assumption in order to make the recovery of M well-posed (e.g., 
see [36]). 

3 Main results 

We now state our main results. We will have two goals — the first is to accurately recover M itself, and 
the second is to accurately recover the distribution of Y given by f(M). 1 

1 Strictly speaking, f(M) £ [0, l] d i xd2 is simply a matrix of scalars, but these scalars implicitly define the distribution of 
Y , so we will sometimes abuse notation slightly and refer to f(M) as the distribution of Y . 
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3.1 Convex programming 

In order to approximate either M or f(M), we will maximize the log-likelihood function of the optimization 
variable X given our observations subject to a set of convex constraints. In our case, the log-likelihood 
function is given by 

F nx (X)-.= (i[y l , 3 =i]log(/(^M')) + i[y I , J =-i]log(i-/(^ J )))- 

(i,j)en 

To recover M , we will use the solution to the following program: 



M = arg maxi^y (X) subject to ||-^||* < ayrd\&2 and H-X^l^, < ol. (3) 
x 

To recover the distribution f(M), we need not enforce the infinity-norm constraint, and will use the 
following simpler program: 



M = argmaxFn,-r(X) subject to < oty rd\d,2 (4) 

x 

In many cases, Fq y{X) is a concave function and thus the above programs are convex. This can be 
easily checked in the case of the logistic model and can also be verified in the case of the probit model 
(e.g., see [48]). 

3.2 Recovery of the matrix 

We now state our main result concerning the recovery of the matrix M. As discussed in Section 1.3 we 
place a "non-spikiness" condition on M to make recovery possible; we enforce this with an infinity-norm 
constraint. Further, some assumptions must be made on / for recovery of M to be feasible. We define 
two quantities L a and f3 a which control the "steepness" and "flatness" of /, respectively: 

|/'(x)| f(x)(l-f(x)) 
L a := sup — — — - and (3 a := sup — . (5 

\x\<a - f[X)) \ x \< a U { X )Y 

In this paper we will restrict our attention to / such that L a and /3 Q are well-defined. In particular, we 
assume that / and /' are non-zero in [—a, a]. This assumption is fairly mild — for example, it includes the 
logistic and probit models (as we will see below in Remark 1). The quantity L a appears only in our upper 
bounds, but it is generally well behaved. The quantity f3 a appears both in our upper and lower bounds. 
Intuitively, it controls the "flatness" of / in the interval [—a, a] — the flatter / is, the larger (3 a is. It is 
clear that some dependence on (5 a is necessary. Indeed, if / is perfectly flat, then the magnitudes of the 
entries of M cannot be recovered, as seen in the noiseless case discussed in Section 1.3. Of course, when a 
is a fixed constant and / is a fixed function, both L a and j3 a are bounded by fixed constants independent 
of the dimension. 

Theorem 1. Assume that H-M"]^ < a\fd\d^r and < a. Suppose that Q is chosen at random 

following the binomial model of Section 2.1 with E|f]| = m. Suppose that Y is generated as in (2). Let 
L a and f3 a be as in (5). Consider the solution M to (3). Then with probability at least 1 — C\/{d\ + cfo); 



1 "M — M\\p < Cj r ^ + d *\ll + (dl + d2) l0g(cM2) 



d\d2 V m V m 

with C a := C2aL a fi a . If m > {d\ + d2)\og(d\d2) then this simplifies to 



^\\M-M\\l<V2cJ r -^±^. (6) 
d\d2 V m 



Above, C\ and C2 are absolute constants. 
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Remark 1 (Recovery in the logistic and probit models). The logistic model satisfies the hypotheses of 
Theorem 1 with [3 a = - ~ e a and L a = 1. The probit model has 

2 o? a + i 

/3q, < cicr e^ 7 and L Q < C2 — 

a 

where we can take c\ = ir and C2 = 8. In particular, in the probit model the bound in (6) reduces to 

Hence, when a < a, increasing the size of the noise leads to significantly improved error bounds — this is 
not an artifact of the proof. We will see in Section 3.4 that the exponential dependence on a in the logistic 
model (and on a/ a in the probit model) is intrinsic to the problem. Intuitively we should expect this 
since for such models, as ||Af ||oo grows large, we essentially revert to the noiseless setting where estimation 
of M is impossible. Furthermore, in Section 3.4 we will also see that when a (or a /a) is bounded by a 
constant, the error bound (6) is optimal up to a constant factor. Fortunately, in many applications, one 
would expect a to be small, and in particular to have little, if any, dependence on the dimension. This 
ensures that each measurement will always have a non-vanishing probability of returning 1 as well as a 
non- vanishing probability of returning —1. 

Finally, note that if M is exactly rank r and satisfies ||Af ||oo < a, then as discussed in Section 2.2, M 
will automatically satisfy the assumptions of Theorem 1. Furthermore, note that the theorem also holds 
if £1 = [di] x [tfe], i.e., if we sample each entry exactly once or observe a complete realization of Y. Even 
in this context, the ability to recover M is somewhat surprising. 



3.3 Recovery of the distribution 

In many situations, we might not be interested in the underlying matrix M, but rather in determining 
the distribution of the unknown entries of Y. For example, in recommender systems, a natural question 
would be to determine the likelihood that a user would enjoy a particular unrated item. 

Surprisingly, this distribution may be accurately recovered without any restriction on the infinity-norm 
of M. This may be unexpected to those familiar with the matrix completion literature in which "non- 
spikiness" constraints seem to be unavoidable. In fact, we will show in Section 3.4 that the bound in 
Theorem 2 is near-optimal; further, we will show that even under the added constraint that ||-M"|| < a, 
it would be impossible to estimate f(M) significantly more accurately. 

Theorem 2. Assume that \\M\\ < a\fd\d^r . Suppose that Q, is chosen at random following the binomial 
model of Section 2.1 with E|Q| = m. Suppose that Y is generated as in (2), and let L = limo^oo L a . Let 
M be the solution to (4). Then, with probability at least 1 — C\j{d\ + da), 

d%(f(M)J(M)) < C 2 aLJ r -^±^Jl+ {dl + d2)log{dld ^. (8) 

V m V m 

Furthermore, as long as m > (d\ + cfe) log^iefe); we have 

d 2 H (f(M),f(M)) < V2C 2 ad r -^±S.. (9) 

V m 

Above, C\ and C2 are absolute constants. 

While L = 1 for the logistic model, the astute reader will have noticed that for the probit model L 
is unbounded — that is, L a tends to 00 as a — > 00. L would also be unbounded for the case where f{x) 
takes values of 1 or outside of some range (as would be the case in (1) if the distribution of the noise had 
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compact support). Fortunately, however, we can recover a result for these cases by enforcing an infinity- 
norm constraint, as described in Theorem 6 below. Moreover, for a large class of functions, /, L is indeed 
bounded. For example, in the latent variable version of (1) if the entries Zi are at least as fat-tailed as 
an exponential random variable, then L is bounded. To be more precise, suppose that / is continuously 
differentiable and for simplicity assume that the distribution of Zjj is symmetric and /(l — f(x)) 

is monotonic for x sufficiently large. If P {\Zij\ >t)>C exp(— ct) for all t > 0, then one can show that L 
is finite. This property is also essentially equivalent to the requirement that a distribution have bounded 
hazard rate. As noted above, this property holds for the logistic distribution, but also for many other 
common distributions, including the Laplacian, student's t, Cauchy, and others. 



3.4 Room for improvement? 

We now discuss the extent to which Theorems 1 and 2 are optimal. We give three theorems, all proved 
using information theoretic methods, which show that these results are nearly tight, even when some of 
our assumptions are relaxed. Theorem 3 gives a lower bound to nearly match the upper bound on the 
error in recovering M derived in Theorem 1. Theorem 4 compares our upper bounds to those available 
without discretization and shows that very little is lost when discretizing to a single bit. Finally, Theorem 
5 gives a lower bound matching, up to a constant factor, the upper bound on the error in recovering the 
distribution f(M) given in Theorem 2. Theorem 5 also shows that Theorem 2 does not suffer by dropping 
the canonical "spikiness" constraint. 

Our lower bounds require a few assumptions, so before we delve into the bounds themselves, we briefly 
argue that these assumptions are rather innocuous. First, without loss of generality (since we can always 
adjust / to account for rescaling M), we assume that a > 1. Next, we require that the parameters be 
sufficiently large so that 

a 2 r max{di, g^} > Co (10) 

for an absolute constant Cq. Note that we could replace this with a simpler, but still mild, condition that 
d\ > Co- Finally, we also require that r > c where c is either 1 or 4 and that r < 0(min{di, d2}/a 2 ), 
where O(-) hides parameters (which may differ in each Theorem) that we make explicit below. This last 
assumption simply means that we are in the situation where r is significantly smaller than d\ and g?2 5 i.e., 
the matrix is of approximately low rank. 
In the following, let 

K={M : ||M||„ < ay^, HATH*, < a} (11) 
denote the set of matrices whose recovery is guaranteed by Theorem 1. 



3.4.1 Recovery from 1-bit measurements 

Theorem 3. Fix a,r,d\, and d<i to be such that r > 4 and (10) holds. Let /3 a be defined as in (5), and 
suppose that f'(x) is decreasing for x > 0. Let £1 be any subset of [d±] x [tfo] with cardinality m, and let Y 
be as in (2). Consider any algorithm which, for any M £ K, takes as input Yij for (i,j) G £1 and returns 
M. Then, there exists M G K such that with probability at least 3/4, 

^M-m>-^{cuC 2a ^J^SM} (12) 

as long as the right-hand side of (12) exceeds ra 2 / min(<ii, tfe). Above, C\ and C2 are absolute constants. 2 

2 Here and in the theorems below, the choice of 3/4 in the probability bound is arbitrary, and can be adjusted at the cost 
of changing Co in (10) and C\ and C2. Similarly, $3 a can be replaced by ( S( 1 _ e ) ct for any e > 0. 
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The requirement that the right-hand side of (12) be larger than ra 2 / min(<ii,d2) is satisfied as long as 
r < 0(min{(ii, c?2}/a 2 ). In particular, it is satisfied whenever 

min(l,/J ) • min(di,d 2 ) 

r S ^3 9 

or 

for a fixed constant C3. Note also that in the latent variable model in (1), f'(x) is simply the probability 
density of Zij. Thus, the requirement that f'(x) be decreasing is simply asking the probability density to 
have decreasing tails. One can easily check that this is satisfied for the logistic and probit models. 

Note that if a is bounded by a constant and / is fixed (in which case f3 a and (5 a i are bounded by a 
constant), then the lower bound of Theorem 3 matches the upper bound given in (6) up to a constant. 
When a is not treated as a constant, the bounds differ by a factor of yfjS^. In the logistic model f3 a « e a 
and so this amounts to the difference between e a / 2 and e a . The probit model has a similar change in the 
constant of the exponent. 



3.4.2 Recovery from unquantized measurements 

Next we show that, surprisingly, very little is lost by discretizing to a single bit. In Theorem 4, we consider 
an "unquantized" version of the latent variable model in (1) with Gaussian noise. That is, let Z be a 
matrix of i.i.d. Gaussian random variables, and suppose the noisy entries Mij + Zij are observed directly, 
without discretization. In this setting, we give a lower bound that still nearly matches the upper bound 
given in Theorem 1, up to the j3 a term. 

Theorem 4. Fix a,r,d\, and d% to be such that r > 1 and (10) holds. Let f2 be any subset of [di] x [cfe] 
with cardinality ra, and let Z be a d\ x da matrix with i.i.d. Gaussian entries with variance a 2 . Consider 
any algorithm which, for any M € K, takes as input Yij = Mij + Zij for (i,j) € £1 and returns M. 
Then, there exists M £ K such that with probability at least 3/4, 



\\M - M\\p > mm { d,C 2 aa\ 1 ' J 



(13) 



d\d<i V m 

as long as the right-hand side of (13) exceeds ra 2 / rnin(di, cfe)- Above, C\ and C2 are absolute constants. 
The requirement that the right-hand side of (13) be larger than ra 2 / mm(di,d2) is satisfied whenever 

min(l, a 2 ) min(di, d 2 ) 



r<C 3 - 



a 2 



for a fixed constant C3. 

Following Remark 1, the lower bound given in (13) matches the upper bound proven in Theorem 1 for 
the solution to (4) up to a constant, as long as a/a is bounded by a constant. In other words: 

When the signal-to-noise ratio is constant, almost nothing is lost by quantizing to a single bit. 

Perhaps it is not particularly surprising that 1-bit quantization induces little loss of information in the 
regime where the noise is comparable to the underlying quantity we wish to estimate — however, what 
is somewhat of a surprise is that the simple convex program in (4) can successfully recover all of the 
information contained in these 1-bit measurements. 

Before proceeding, we also briefly note that our Theorem 4 is somewhat similar to Theorem 3 in [36]. 
The authors in [36] consider slightly different sets K: these sets are more restrictive in the sense that it 
is required that a > \/32 log n and less restrictive because the nuclear-norm constraint may be replaced 
by a general Schatten-p norm constraint. It was important for us to allow a = 0(1) in order to compare 
with our upper bounds due to the exponential dependence of f3 a on a in Theorem 1 for the probit model. 
This led to some new challenges in the proof. Finally, it is also noteworthy that our statements hold for 
arbitrary sets 0, while the argument in [36] is only valid for a random choice of f2. 
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3.4.3 Recovery of the distribution from 1-bit measurements 

To conclude we address the optimality of Theorem 2. We show that under mild conditions on /, any 
algorithm that recovers the distribution f(M) must yield an estimate whose Hellinger distance deviates 
from the true distribution by an amount proportional to ayrdi^Jm, matching the upper bound of (9) 
up to a constant. Notice that the lower bound holds even if the algorithm is promised that ||Af ||oo < a, 
which the upper bound did not require. 

Theorem 5. Fix a,r,di, and di to be such that r > 4 and (10) holds. Let L\ be defined as in (5), and 
suppose that f'(x) > c and c' < f(x) < 1 — c' for x £ [—1,1], for some constants c,c' > 0. Let f2 be 
any subset of [di] x [c^] with cardinality m, and let Y be as in (2). Consider any algorithm which, for 
any M £ K, takes as input Yij for £ and returns M. Then, there exists M £ K such that with 
probability at least 3/4, 

<4(/(M),/(M)) > min/cC^ySnSj (14 ) 

as long as the right-hand side of (14) exceeds ra 2 / min(di, cfo)- Above, C\ and C2 are constants that depend 
on c, d . 

The requirement that the right-hand side of (14) be larger than ra 2 / min(di,d2) is satisfied whenever 

. . / 1 \ mm(d 1 ,d 2 ) 
r<C 3 mm[l,^J a2 

for a constant C3 that depends only on c, d . Note also that the condition that / and /' be well-behaved 
in the interval [—1, 1] is satisfied for the logistic model with c = 1/4 and d = < 0.269. Similarly, we 
may take c = 0.242 and d = 0.159 in the probit model. 



4 Proofs of the main results 

In this section we provide the proofs of the main theorems presented in Section 3. To begin, we first define 
some additional notation that we will need for the proofs. For two probability distributions V and Q on a 
finite set A, D(V\\Q) will denote the Kullback-Leibler (KL) divergence, 

D(V\\Q) = ^P(x)log 

where V(x) denotes the probability of the outcome x under the distribution V . We will abuse this notation 
slightly by overloading it in two ways. First, for scalar inputs p,q £ [0, 1], we will set 

D(p\\q) =plog (J J +(l-p)log 

Second, for two matrices P, Q £ [0, l] d i xd 2 ) we define 

D(P\\Q) = -^YlD(P i jQ i j). 

We first prove Theorem 2. Theorem 1 will then follow from an approximation argument. Finally, our lower 
bounds will be proved in Section 4.3 using information theoretic arguments. 
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4.1 Proof of Theorem 2 

We will actually prove a slightly more general statement, which will be helpful in the proof of Theorem 1. 
We will assume that ||A<f || < 7, and we will modify the program (4) to enforce ||_X" || < 7. That is, we 
will consider the program 

M = argmaxFn^ (X) subject to — oi\/rd\d 2 and < 7. (15) 

x 

We will then send 7 — > 00 to recover the statement of Theorem 2. Formally, we prove the following theorem. 

Theorem 6. Assume that ||Af"|| < ct\Jrd\d2 and ||A<f|| < 7. Suppose that $7 is chosen at random 
following the binomial model of Section 2.1 and satisfying E = m. Suppose that Y is generated as in (2), 
and let L 7 be as in (5). Let M be the solution to (15). Then, with probability at least 1 — C\/{d\ + d 2 ), 



4(/(M),/(M)) < C 2 L 7 aJ r -^±^Jl + (*+^)**(<W (1 6) 

V m V m 

Above, C\ and C 2 are absolute constants. 

The key to proving Theorem 6 will be to establish the following concentration inequality. 
Lemma 1. Let G C R dlXd2 be 

G={xeR dlXd2 : IIXH. < ay/rdid 2 } 
for some r < min{c?i, d 2 } and a > 0. Then 

sup \F n , Y (X) -MF Q}Y (X)\ > CoaL^y/Mdl + d 2 ) + did 2 log(did 2 ) < , , , (17) 
xeG J «i + «2 

where Cq and C\ are absolute constants and the probability and the expectation are both over the choice of 
£1 and the draw ofY. 

We will prove this lemma below, but first we show how it implies Theorem 6. To begin, notice that for 
any choice of X , 

E ,F* r <*> - %(M)1 = £ £ (/<««) ^ + (1 " /(««» >o g (1^) ) 

= -mD(f(M)\\f(X)), 

where the expectation is over both ft and Y. Next, note that by assumption M € G. Moreover, from the 
definition of M we also have that M € G and F$iy(M) > F$iy(M). Thus, we can write 

< F n ,y(M) - F n>Y (M) 
= F nx {M)+EF nx (M)-EF QX (M) + EF Q , Y {M)-EF nx {M)-F nx (M) 



< E 



Fa,Y (M) - F n ,Y (M) + | F n ,y (M) - E (M) | + | F QX (M) - E F n , r (M) | 



< -mL>(/(M)||/(M)) + 2 sup |F n , y (X) -E%(X)| . 

xeG 

Applying Lemma 1, we obtain that with probability at least 1 — C\j{d\ + d 2 ), we have 



< -mD(f(M)\\f{M)) + 2C aL^y / m{d l +d 2 )+d 1 d 2 log(d 1 d 2 
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In this case, by rearranging and applying the fact that \fd\d^ < d\ + d 2 , we obtain 



D(f(M)\\f(M))<2C aL, 



r{di + d 2 ) 



(d\ + d 2 ) log(did 2 ) 



1 + 



(18) 



m V m 

Finally, we note that the KL divergence can easily be bounded below by the Hellinger distance: 

d 2 H (p,q)<D(p\\q). 

This is a simple consequence of Jensen's inequality combined with the fact that 1 — x < — logic. Thus, 
from (18) we obtain 



d 2 H (f(M),f(M))<2C aL, 



r{di + d 2 ) 



(d 1 + d 2 ) log(di<Z 2 ) 



1 + 



m V m 
which establishes Theorem 6. Theorem 2 then follows by taking the limit as 7 — > 00. 

Proof of Lemma 1. We begin by noting that for any h > 0, by using Markov's inequality we have that 



< 



sup \F nx (X) - EF n>Y (X)\ > CoaL^^y/midi + d 2 ) + d x d 2 log(did 2 ) 
xeG 

P fsup \Fa tY (X) -EF n>Y (X)\ h > (c aL 7 ^Vm(di + d 2 ) + d x d 2 log(did 2 )) * ) 
E[sup XeG \F nx (X)-EF n ,Y(X)\ h ] 



C aL^^y/m(di + d 2 ) + d x d 2 \og(did 2 ) 



h ■ 



(19) 



The bound in (17) will follow by combining this with an upper bound on E [sup^ gG \Fq iY (X) — E Fq iY (X)\ h ] 
and setting h = log(di + d 2 ). Towards this end, note that we can write the definition of Fa Y as 

Fa, Y (X) = (l[(ij)en] (l[K 1J= i] M/pQ,;)) + t [Y , J= -i] log(l - • 
By a symmetrization argument (Lemma 6.3 in [31]), 



E 



sup \F nx (X)-KF n , Y (X)\ h 



xeG 



< 2 h E 



sup 
xeG 



h3 



where the , are i.i.d. Rademacher random variables and the expectation in the upper bound is with 
respect to both Q, and Y as well as with respect to the e^j. To bound the latter term, we apply a 
contraction principle (Theorem 4.12 in [31]). By the definition of L 7 and the assumption that ||Af Hoc < 7, 
both log(/(x))/L 7 and log(l — f(x))/L^ are contractions. Thus, up to a factor of 2, the expected value 
of the supremum can only decrease when log(f(Xij)) is replaced by Xjj and similarly log(l — f(Xij)) by 



-Xij. Thus we obtain 



E sup \F^ Y (X)-EF n , Y (X) 



< 2 h (2L 1 ) h E 
= (4L 7 ) h E 



sup 

xeG 



sup \(A Q oEoY,X)\ h 
XeG 



(20) 
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where E denotes the matrix with entries given by Eij, Aq denotes the indicator matrix for Q (so that 
[Afijjj = 1 if £ and otherwise), and o denotes the Hadamard product. Using the facts that the 
distribution of E o Y is the same as the distribution of E and that |(A,£?)| < we have that 



E 



sup KAiioEoy.i) 



= E 

< E 



sup \(EoA n ,X}\ 
xeG 



sup \\E o A n \\ h \\X\\^ 

XfEG 



o 



Vdid 2 r) E 



|£oA 



(21) 



To bound E [\\E o A n \\ h ], observe that E o An is a matrix with i.i.d. zero mean entries and thus by 
Theorem 1.1 of [43], 



E 



\Eo A 



< C 



( 




/ d 2 


E 


max 

l<i<di 




I 







+ E 



max 

l<j<d 2 



a=l 



for some constant C. This in turn implies that 



E 



\Eo A 



/ 












E 






E 


V 











max 

i<i<rf 2 



E A 



/ 



(22) 



We first focus on the row sum Ylj=i f° r a particular choice of i. Using Bernstein's inequality, for all 
t > we have 



rf 2 

E ( ^ 



> t < 2 exp 



-t 2 /2 



In particular, if we set t > 6m/ d±, then for each z we have 

> t ) < 2 exp(-t) = 2P (Wi > t) , 



d 2 

E ( A 



??? 



cW 2 



(23) 



where Wi, . . . , are i.i.d. exponential random variables. 

Below we use the fact that for any positive random variable q we can write E q = f °° P (q > t), allowing 
us to bound 



+ 



+ 



i 

2h 



( 






h 


E 


max > Aj i 

«** [2.. ,) 






V 









< 



+ 




>t\dt 
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< A — 



'171 



( 



( 6m\ 



(6m/d 1 ) h 



max 

l<i<di 



d 2 



I -J 



rn 



d\d 2 



1 

2/i 



6m 



max W; > t ) dt 



j_ 

_>/i 



' m 



d 



6m\ 



h 



< J -r + + 2E 



( max Wj) 
l<i<di 



i 

2/. 



Above, we have used the triangle inequality in the first line, followed by Jensen's inequality in the second 
line. In the fifth line, (23), along with independence, allows us to introduce maxj W{. By standard 
computations for exponential random variables, 



E 



max Wj 1 

l<i<di 



< E 



max Wi — log di 

l<i<di 



+ log h (d 1 ) 



< 2h\ + log h (d 1 ). 



Thus, we have 



/ 




/ d 2 


E 


max 

l<i<di 




I 







I.J 



< 



+ 



6m 



+ 2(\og h (d 1 ) + 2(h\) 



i 

2/i 



< (1 + ^)^^ + 2271 ^logdi + 2&y/h 



< (1 + + (2 + V2) Vl°g(di + d 2), 

V "i 

using the choice h = log(di + d 2 ) > 1 in the final line. 

A similar argument bounds the column sums, and thus from (22) we conclude that 



E 



< d ( (l + VE) 



m 




, J +(2 + \/2)V / iogpiT^ 



+ (2 + V2)Vlog(di + d 2 



< ch(i + V6) A / m(dl + d2) + dld2log(dl + d2) , 

y d\d 2 

where the second and third inequalities both follow from Jensen's inequality. Combining this with (20) 
and (21), we obtain 



E 



sup \F nx (X)-EF nx (X) 
X&G 



< CK8(1 + V6)aLj^^m(di + d 2 ) + d\d 2 \og{di + d 2 ). 



Plugging this into 19 we obtain that the probability in 19 is upper bounded by 

log(di+d2) 



c 



;i + \/6) 



< 



c 



d\ + d 2 



provided that Co > 8(1 + \/6)/e, which establishes the lemma. 



□ 
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4.2 Proof of Theorem 1 

The proof of Theorem 1 follows immediately from Theorem 6 (with 7 = a) combined with the following 
lemma. 

Lemma 2. Let f be a differentiable function and let , H-M^oo < a. Then 

£ (f(M) f(M)) > inf (/W \\M-M\\l 
d H (f(M),f(M)) > ,^8/(6(1-/(6) d ld2 ■ 

Proof. For any pair of entries x = Mi a and y = Mij, write 

Using Taylor's theorem to expand the quantity inside the square, for some £ between x and y, 

>1 (/W(S _ I)2 (_!_+_!_ 
(/,({)>2 



8/(0(1-/(0) 

The lemma follows by summing across all entries and dividing by d\d 2 . □ 
4.3 Lower bounds 

The proofs of our lower bounds each follow a similar outline, using classical information theoretic techniques 
that have also proven useful in the context of compressed sensing [8, 39]. At a high level, our argument 
involves first showing the existence of a set X of matrices, so that for each X® ^ X^> € X ', \\X^ — X^ \\p 
is large. We will imagine obtaining measurements of a randomly chosen matrix in X and then running 
an arbitrary recovery procedure. If the recovered matrix is sufficiently close to the original matrix, then 
we could determine which element of X was chosen. However, Fano's inequality will imply that the 
probability of correctly identifying the chosen matrix is small, which will induce a lower bound on how 
close the recovered matrix can be to the original matrix. 

In the proofs of Theorems 3, 4, and 5, we will assume without loss of generality that d 2 > d\. Before 
providing these proofs, however, we first consider the construction of the set X. 



4.3.1 Packing set construction 

Lemma 3. Let K be define 
There is a set X C K with 



Lemma 3. Let K be defined as in (11), let 7 < 1 be such that is an integer, and suppose that < d\. 



\X\ > exp 

with the following properties: 

1. For all X £ X , each entry has \Xij\ = 07. 

2. For allX^,X^ G X, i^j, 



rd 2 
I67 1 



\x^-x^\\l> a \ dld2 . 
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Proof. We use a probabilistic argument. The set X will by constructed by drawing 

rd 2 



\X\ 



cxp 



I67 5 



(24) 



matrices X independently from the following distribution. Set B = The matrix will consist of blocks 
of dimensions B x d 2 , stacked on top of each other. The entries of the first block (that is, X$j for 
€ [B] x [d 2 ]) will be i.i.d. symmetric random variables with values ±07. Then X will be filled out 
by copying this block as many times as will fit. That is, 

Xij := Xy a where i = i (modi?) + 1. 

Now we argue that with nonzero probability, this set will have all the desired properties. For X 6 X, 

Halloo = cry < «• 

Further, because rank X < B, 



I XL < Vb\\x\ 



Vdid 



207 = a 



\Jrd\d 2 . 



Thus X C K, and all that remains is to show that X satisfies requirement 2. 
For X , W drawn from the above distribution, 



I* " w W 2 f = E(*i,j ~ Wi >i) 
hi 

di 



> 



B 



4a 2 j 2 



--: 4a 2 7 2 



E E (*« 

ie[B] je[(fe] 
di 



B 

di 
B 



E E 

ie[B]i6[efa] 

Z(X,W). 



where the <5jj are independent 0/1 Bernoulli random variables with mean 1/2. By Hoeffding's inequality 
and a union bound, 

I ATI 



U m ^ Z(X ' W) ^) * (' 2') -P("^/ 8 )- 



One can check that for X of the size given in (24), the right-hand side of the above tail bound is less than 
1, and thus the event that Z(X, W) > d 2 B /4 for all X ^ W £ X has non-zero probability. In this event, 



\X - W\\ 2 F > q 2 7 2 



di 
B 



d 2 B > 



a 2 j 2 did 2 



where the second inequality uses the assumption that d\ > B and the fact that \_x\ > x/2 for all x > 1. 
Hence, requirement (2) holds with nonzero probability and thus the desired set exists. □ 

4.3.2 Proof of Theorem 3 

Before we prove Theorem 3, we will need the following lemma about the KL divergence. 
Lemma 4. Suppose that x,y £ (0, 1). Then 

(x - y) 2 



D{x\ 



< 



y(i - y)' 
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Proof. Without loss of generality, we may assume that x < y. Indeed, D(l — x\\l — y) = D(x\\y), and 
either x < y or 1 — x < 1 — y. Let z = y — x. A simple computation shows that 

d z 
-D(x\\x + z) 



dz (x + z)(l — x — z) 

Thus, by Taylor's theorem, there is some £ E [0, z] so that 



D(x\\y) = D(x\\x) + z 



Xx + z)(i-x-z)J ■ 

Since the right hand side is increasing in £, we may replace £ with z and conclude 

D(x\\y) < ^f^, 

y(i - y) 

as desired. □ 
Now, for the proof of Theorem 3, we choose e so that 



e 2 = min j JL , ^a^^^ j , (25) 

where C2 is an absolute constant to be specified later. We will next use Lemma 3 to construct a set X, 
choosing 7 so that ^ is an integer and 

4^2e 8e 
< 7 < — . 

a a 

We can make such a choice because e < 1/32 and r > 4. We verify that such a choice for 7 satisfies the 
requirements of Lemma 3. Indeed, since e < 1/32 and a > 1 we have 7 < 1/4 < 1. Further, we assume in 
the theorem that the right-hand side of (25) is larger than Cra 2 jd\ which implies that r/7 2 < d\ for an 
appropriate choice of C. 

Let <^/2 7 ^ e the set whose existence is guaranteed by Lemma 3 with this choice of 7, and with a/2 
instead of a. We will construct X by setting 



X:={x' + a(l-l)l : X' £ X' a/ ^) 



Note that X has the same size as X^ , i.e., \X\ satisfies (24). X also has the same bound on pairwise 
distances 

||jr(0 _ > > Uld2e \ (26) 

8 

and every entry of X E X has 

\Xij\ £ {a, a'}, 
where a' = (1 — 7)0. Further, since for X' E X^ 2 ^, 



||X' + a(l -7/2)1^ < llJC'H, + a(l - 7/2)^1^2 < rd x d 2 

for r > 4 as in the theorem statement. 

Now suppose for the sake of a contradiction that there exists an algorithm such that for any X E K, 
when given access to the measurements Y^, returns an X such that 

"X-X\\ 2 F <e 2 (27) 



did 
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with probability at least 1/4. We will imagine running this algorithm on a matrix X chosen uniformly at 
random from X. Let 

X* = argmin||XW - X\\ 2 F . 

It is easy to check that if (27) holds, then X* = X. Indeed, for any X' G X with X' ^ X, from (27) 
and (26) we have that 

\\X' - X\\ F = \\X' - X + X - X\\ F > \\X' -X\\ F - \\X - X\\ F > - \phd 2 t = \/d^e. 

At the same time, since X G X is a candidate for X* , we have that 



||X* - X\\ F < \\X - X\\ F < Vd^he. 

Thus, if (27) holds, then ||X* — X\\f < \\X' — X\\f for any X' G X with X' ^ X, and hence we must 
have X* = X. By assumption, (27) holds with probability at least 1/4, and thus 

P(X/X*)<^. (28) 

We will show that this probability must in fact be large, generating our contradiction. 
By a variant of Fano's inequality 

max yW , Y(e) D(Y a \X^ II Y n \X^) + l 

P (X ^ X*) > 1 \ , , " ' (29) 

log | AT | y ' 

Because each entry of Y is independent, 3 

D:=D(Y n \X^ || Y a \X®) = £ D(Y itj \X% \\ Y^X®). 

Each term in the sum is either 0, D(a||a'), or Z?(a'||a). By Lemma 4, all of these are bounded above by 

nrv. i v-( fc ) II v- \ttM\ < (/(")- /K)) 2 
[ ijl ^ 11 idl ^ ' ~ f(a')(l - f(a')Y 

and so, from the intermediate value theorem, for some £ G [a', a], 

" /(a')(l-/(a')) " /(«')(! "/(«')) ' 
Using the assumption that f'(x) is decreasing for x > and the definition of a' = (1 — 7)a, we have 

^ m(7«) 2 ^ 64me 2 

Then from (29) and (28), 

\ < l - P (x ^ x*) < i^±i < i6 7 2 ( ^b— I < 1024e2 I ^br_ I • ( 30 ) 

We now show that for appropriate values of Co and C 2 , this leads to a contradiction. First suppose that 
64me 2 < (3 a r . In this case we have 

1 9 2 

- < 1024e 2 - 



/ 64me 2 , \ 


\ < 1024e 2 I 




\ rd<1 ) 







4 a 2 rdo ' 



3 Note that here, to be consistent with the literature we are referencing regarding Fano's inequality, D is defined slightly 
differently than elsewhere in the paper where we would weight D by \jd\d-2.. 
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which together with (25) implies that a 2 rd,2 < 8. If we set Co > 8 in (10), then this would lead to a 
contradiction. Thus, suppose now that 64me 2 > (3 a > . Then (30) simplifies to 

1 1024 • 128 ■ me 4 

Thus 

€ 512V2V m ' 

Note (3 is increasing as a function of a and a' > 3a/4 (since 7 < 1/4). Thus, /3 a > > (3 Sa /4. Setting 
C 2 < 1/512 V2 in (25) now leads to a contradiction, and hence (27) must fail to hold with probability at 
least 3/4, which proves the theorem. 



4.3.3 Proof of Theorem 4 

Choose e so that 



e2=min i^ c W ! ^ \ (3i) 



for an absolute constant C2 to be determined later. As in the proof of Theorem 3, we will consider running 
such an algorithm on a random element in a set X C K. For our set X, we will use the set whose existence 
is guaranteed by Lemma 3. We will set 7 so that ^ is an integer and 

2V2e 4e 
< 7 < — . 

a a 

This is possible since e < 1/4 and r, a > 1. One can check that 7 satisfies the assumptions of Lemma 3. 

Now suppose that X E X is chosen uniformly at random, and let Y = (X + Z)\q as in the statement 
of the theorem. Let X be any estimate of X obtained from Yq. We begin by bounding the mutual 
information I{X\X) in the following lemma (which is analogous to [12, Equation 9.16]). 

Lemma 5. 

/(X;X)<|log(a 2 +(aV)). 

Proof. We begin by noting that 

I(X n ; Y) = h(X n + Z a ) - h(X n + Z Q \X n ) = h(X n + Z n ) - h(Z n ), 
where h denotes the differential entropy. Let £ denote a matrix of i.i.d. ±1 entries. Then 

h(X n o£ + Z n ) = h((X n + Z n ) o £) > ^((X n + Z n ) ° £ | £) = ^(^n + Z n ), 
and so, letting X = Io^, 

/(X n ; F) < K^o + Z n ) - ^(Z n ). 
Treating Xo + Zq as a random vector of length m, we compute the covariance matrix as 

vec(X n + Z n ) vec(X n + Z n ) T ] = (a 2 + (a 7 ) 2 ) I r 



S := E 
By Theorem 8.6.5 in [12], 



h(X n + Z n )< 1 - log ((2vre) m det(S)) = \ log ((2vre) m ( CT 2 + (a 7 ) 2 ) m ) 
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We have that h{Z n ) = \ ((2^e) m a 2m ), and so 



Then the data processing inequality implies 

I(X;X) <^log( 1 + 



V a J 

which establishes the lemma. □ 

We now proceed by using essentially the same argument as in the proof of Theorem 3. Specifically, 
we suppose for the sake of a contradiction that there exists an algorithm such that for any X G K, when 
given access to the measurements Yq, returns an X such that 

-±-i\X-Xf F <S (32) 

with probability at least 1/4. As before, if we set 

X* = argmin||X (i) - X\\ 2 F 
x^ex 

then we can show that if (32) holds, then X* = X. Thus, if (32) holds with probability at least 1/4 then 

¥{X^X*)<^. (33) 

However, by Fano's inequality, the probability that X ^ X is at least 

^ H(X\X) - 1 H(X)-I(X;X)-1 _ I{X;X) + l 
T >- log(|*|) log(|*|) " log |Af I 

Plugging in \X\ from Lemma 3 and I(X; X) from Lemma 5, and using the inequality log(l + z) < z, we 
obtain 

Combining this with (33) and using the fact that 7 < 4e/a, we obtain 

I < 256e2 f 8m (f\ A 

4 ~ a 2 rd 2 V V " 2 / / 

We now argue, as before, that this leads to a contradiction. Specifically, if 8me 2 /a 2 < 1, then together 
with (31) this implies that a 2 rd,2 < 128. If we set Co > 128 in (10), then this would lead to a contradiction. 
Thus, suppose now that 8me 2 /a 2 > 1, in which case we have 



e2 aa_ rda 
128 V m ' 

Thus, setting C2 < 1/128 in (31) leads to a contradiction, and hence (32) must fail to hold with probability 
at least 3/4, which proves the theorem. 
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4.3.4 Proof of Theorem 5 

The proof of Theorem 5 also mirrors the proof of Theorem 3. The main difference is the observation that 
the set constructed in Lemma 3 also works with the Hellinger distance. We begin as before by choosing e 
so that _ 

where C2 is an absolute constant to be determined. Set 7 to be an integer so that 

r- e 4e 
2V2— < 7 < — . 

ac ac 

This is possible since by assumption a > 1 and e < |. One can check that 7 satisfies the assumptions of 
Lemma 3. 

As in the proof of Theorem 3, we will consider running such an algorithm on a random element in a 
set X C K. For our set X, we will use the set whose existence is guaranteed by Lemma 3. Note that since 
the Hellinger distance is bounded below by the Frobenius norm, we have that for all X® + X® G X, 

d%(f(X®) - f( X ®)) > - > c 2 ||X« - X®\\ 2 F > yaVdida > 4did 2 e 2 . 

Now suppose for the sake of a contradiction that there exists an algorithm such that for any X € K, when 
given access to the measurements Yq, returns an X such that 

d 2 H (f(X)J(X))<e 2 (35) 

with probability at least 1/4. If we set 

X* = argmin4(/(X«)-/(X)) 

then we can show that if (35) holds, then X* = X. Thus, if (35) holds with probability at least 1/4 then 

F(X^X*)<^. (36) 
However, we may again apply Fano's inequality as in (29). Using Lemma 4 we have 

< 

for some |£| < 07, where L ai is as in (5). By the assumption that d <\f{x)\ < 1 — c' for \x\ < 1, and that 



W M (/(o7) ~ f(-ai)) 2 4(f(g)) 2 a 2 7 2 4/ 2 (QL 2 7 a 2 7 2 

1 h3] l ' j " M ' /(«7)(l-/(«7)) " /(a 7 )(l-/(a 7 )) " /(« 7 )(l-/(«7)) : 



we obtain 



, ' 4e \ 4e 
07 < a — < — < 1, 
ac J c 



m,^) 11 Y^xm < < O^V, 



where C" = 64c'/(c 2 (l - c')). Thus, from (29), we have 



1 C'mL 2 e 2 + l 256 2 /C"mL 2 e 2 + l 
- < -, — ; < — s-e 



4 log I A? I c 2 \ a 2 rc?2 

We now argue once again that this leads to a contradiction. Specifically, if C'mL 2 e 2 < 1, then together 
with (34) this implies that a 2 rd>2 < 128/c. If we set Co > 128/c in (10), then this would lead to a 
contradiction. Thus, suppose now that C'mL 2 e 2 > 1, in which case we have 



2 c a /rcfe 



32 v / 2C 7L i V m 

Thus setting C2 < c/32\/2C' in (34) leads to a contradiction, and hence (35) must fail to hold with 
probability at least 3/4, which proves the theorem. 
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5 Discussion 



Many of the applications of matrix completion consider discrete data, sometimes consisting of binary 
measurements. This paper addresses such situations. However, matrix completion from noiseless binary 
measurements is extremely ill-posed, even if one collects a binary measurement from all of the matrix en- 
tries. Fortunately, when there are some stochastic variations (noise) in the problem, matrix reconstruction 
becomes well-posed. We demonstrate that the unknown matrix can be accurately and efficiently recovered 
from binary measurements. When the infinity norm of the unknown matrix is bounded by a constant, we 
show that our error bounds are tight to within a constant and even match what is possible for undiscretized 
data. We also show that the binary probability distribution can be reconstructed over the entire matrix 
without any assumption on the infinity-norm, and we give a matching lower bound (up to a constant). 

Our theory considers approximately low-rank matrices — in particular, we assume that the singular val- 
ues belong to a scaled Schatten-1 ball. It would be interesting to see whether more accurate reconstruction 
could be achieved under the assumption that the unknown matrix has precisely r nonzero singular values. 
It would also be interesting to study whether our ideas can be extended to deal with measurements that 
are quantized to more than 2 (but still a small number) of different values. 
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